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ABSTRACT 



OO 
O . 

We describe a Bayesian approach to estimating luminosity functions. We derive the likelihood func- 
{SJ ■ tion and posterior probability distribution for the luminosity function, given the observed data, and 

we compare the Bayesian approach with maximum-likelihood by simulating sources from a Schechter 
function. For our simulations confidence intervals derived from bootstrapping the maximum-likelihood 
estimate can be too narrow, while confidence intervals derived from the Bayesian approach are valid. 
We develop our statistical approach for a flexible model where the luminosity function is modeled as 
a mixture of Gaussian functions. Statistical inference is performed using Markov chain Monte Carlo 
(MCMC) methods, and we describe a Metropolis-Hastings algorithm to perform the MCMC. The 
MCMC simulates random draws from the probability distribution of the luminosity function param- 
eters, given the data, and we use a simulated data set to show how these random draws may be used 
■ to estimate the probability distribution for the luminosity function. In addition, we show how the 

MCMC output may be used to estimate the probability distribution of any quantities derived from 
the luminosity function, such as the peak in the space density of quasars. The Bayesian method we 
develop has the advantage that it is able to place accurate constraints on the luminosity function even 
beyond the survey detection limits, and that it provides a natural way of estimating the probabil- 
ity distribution of any quantities derived from the luminosity function, including those that rely on 
information beyond the survey detection limits. 

Subject headings: galaxies: luminosity function — methods: data analysis — methods: numerical — 
methods: statistical 
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1. INTRODUCTION 

The luminosity function (LF) has been an important tool for understanding the evolution of galaxies and quasars, as 
j \ it provides a census of the galax y and quasar populations oyer cosmic time. Quasar luminosity functions have been es- 
timat ed for optical surveys (e.g..lFan et al.ll2001b IWolf et al.|[200l ICroom et al.ll2004l: [Richards et al.ll2006t Uiang et all 
120061 ). X-r av surveys (e.g..lSteffen et al.ll2003t lUeda et al.ll2003HBarger et al.ll2005HLa Franca et alJl2005l), infrared sur- 
veys (e.g.. iBarger et al.ll2005HMatute et al.ll2006t [Babbedge et al.l l 2006h . radio surveys (e.g.. IWaddington et al.ll2001t 
IWiliott et alJl200C and emission lines (|Hao et al.l l2005) . In addition, lumi nosity functions acro ss different bands have 
been combined to form an estimate of the bolometric luminosity function (|Hopkins et al.ll2007f ) . Besides providing an 
important constraint on models of quasar evolution and supermassive black hole growth (e.g.. IWvithe fc Loebl l2003: 
iHopkins et al.ll2006al) . studies of the LF have found evidence for 'cosmic downsizing', where the space density of more 
luminous quasars peaks at higher redshift. Attempts to map the growt h of sup e rmass ive black holes start from the 
local supermassive black hole distribution, and employ the argument of ISoltanl <|1982f l . using the quasar luminosity 
function as a constraint on the black hole mass distribution. These studies have foun d evidence that the highest 
mass black holes grow first (e.g., Yu & Tremaine 2002; Marconi et al. 2004; Merloni 2004), suggesting that this cosmic 

downsizing is the result of an anti-hierarchical growth of supermassive black h oles. 

Similarly, galaxy luminosity functions h ave been esti mated in the optical (e.g..lBlanton et ail f2003: Dah len et al.ll2005l: 
' Brown et al.ll2007tlAfarchesini et al.l2007l) . X-ray (e.g.jKim et al.l2006HPtak et alj2007l) . infrared (e gJCirasuolo et all 
20071: iHuvnh et alj|2007l ), ultraviolet (e.g.. iBudavari et al.ll200a; iPaltani et al.ll2007D. radio (e.g.. iLin fc Mohrll2007t 
Mauch fe S adler 2007 ), for galaxies in clusters (e.g.. lPopesso et al. I I2006H H arsono fc de Propris1l2007T r and for galaxies 



in voids (H ovle et 



20051 ). The galaxy luminosity function probes several aspects of the gala xy population; namely 



(a) the evolution of stellar populations a nd star formation histories (e.g.. iFaber et a.1.1 12007T ) . (b) the local super- 
massive black hole ma ss distribution (e.g. lYu fc Tremainei 120021 : iMarconi et al.ll2004f ) via the Magorrian relationship 
dMagorrian et al.l fl998). (c) the dependence of galaxy properties on environment (e.g.. lCroton et al 1120051: lLauer et all 
20071) . and (d) places constraints on models of stru cture formation and galaxy evolution (e.g.. iBower et al.l 120061 : 



Finlator et ail 2006: Marchesini fc van Dokkuml l2007). 



Given the importance of the luminosity function as an observational constraint on models of quasar and galaxy 
evolution, it is essential that a statistically accurate approach be employed when estimating these quantities. However, 
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the existence of complicated selection functions hinders this, and, as a result, a variety of methods have been used 
to ac curately account for the selection function when estimatin g the LF. These include various bi nning methods 
(e.g.. ISchmidlll968t lAvni fe Bahcalilll980t IPage fc Carreral |2000H . maximum-likelihood fitting (e.g., iMarshall et all 
119831: iFan et all I2001h 7 and a powerful semi-parametric approach (|Schaferl l2007f ). In addition, there have been a 
variety of methods proposed for estimating the cum ulative distribution function of the LF (e.g., iLvnden- Bcll 1971; 
lEfron fc Petrosianlfl992l iMalonev fc Petrosianlll999t ). 

Each of these statistical methods has advantages and disadvantages. Statistical inference based on the binning 
procedures cannot be extended beyond the support of the selection function, and the cumulative distribution function 
methods typically assume that luminosity and redshift are statistically independent. Furthermore, one is faced with 
the arbitrary choice of bin size. The maximum-likelihood approach typically assumes a restrictive and somewhat ad 
hoc parametric form, and has not been used to give an estimate of the LF normalization; instead, for example, the 
LF normalization is often chosen to make the expected number of sources detected in one's survey equal to the actual 
number of sources detected. In addition, confidence intervals based on the errors derived from the various procedures 
are typically derived by assuming that the uncertainties on the LF parameters have a Gaussian distribution. While 
this is valid as the sample size approaches infinity, it is not necessarily a good approximation for finite sample sizes. 
This is particularly problematic if one is employing the best fit results to extrapolating the luminosity function beyond 
the bounds of the selection function. It is unclear if the probability distribution of the uncertainty in the estimated 
luminosity function below the flux limit is even asymptotically normal. 

Motivated by these issues, we have developed a Bayesian method for estimating the luminosity function. We derive 
the likelihood function of the LF by relating the observed data to the true LF, assuming some parametric form, and 
derive the posterior probability distribution of the LF parameters, given the observed data. While the likelihood 
function and posterior are valid for any parametric form, we focus on a flexible parametric model where the LF is 
modeled as a weighted sum of Gaussian functions. This is a type of 'non-parametric' approach, where the basic 
idea is that the individual Gaussian functions do not have any physical meaning, but that given enough Gaussian 
functions one can ob tain a suitably accurate approximat ion to the tr ue LF; a similar approach has been taken by 
iBlanton et al.l (|2003l ) for estimating galaxy LFs, and by Kelly (2007) within the context of linear regression with 
measurement error. Modeling the LF as a mixture of Gaussian functions avoids the problem of choosing a particular 
parametric form, especially in the absence of any guidance from astrophysical theory. The mixture of Gau s sians model 
has been studied from a Bayesian pers pective by numerous authors (e.g., iRoeder fc Wassermanl 119971 Uasra et all 
l2005t [Pellaportas fc Pa pagcorgiou 2006). In addition, we describe a Markov chain Monte Carlo (MCMC) algorithm 
for obtaining random draws from the posterior distribution. These random draws allow one to estimate the posterior 
distribution for the LF, as well as any quantities derived from it. The MCMC method therefore allows a straight- 
forward method of calculating uncertainties on any quantity derived from the LF, such as the redshift where the space 
density of quasars or galaxies peaks; this has proven to be a challenge for other statistical methods developed for 
LF estimation. Because the Bayesian approach is valid for any sample size, one is therefore able to place reliable 
constraints on the LF and related quantities even below the survey flux limits. 

Because of the diversity and mathematical complexity of some parts of this paper, we summarize the main results 
here. We do this so that the reader who is only interested in specific aspects of this paper can conveniently consult 
the sections of interest. 

• In § [2] we derive the general form of the likelihood function for luminosity function estimation. We show that 
the commonly used likelihood function based on the Poisson distribution is incorrect, and that the correct form 
of the likelihood function is derived from the binomial distribution. However, because the Poisson distribution 
is the limit of the binomial distribution as the probability of including a source in a survey approaches zero, 
the maximum-likelihood estimates derived from the two distribution give nearly identical results so long as a 
survey's detection probability is small. The reader who is interested in using the correct form of the likelihood 
function of the LF should consult this section. 

• In § |3] we describe a Bayesian approach to luminosity function estimation. We build on the likelihood function 
derived in § [2] to derive the probability distribution of the luminosity function, given the observed data (i.e., 
the posterior distribution). We use a simple example based on a Schechter function to illustrate the Bayesian 
approach, and compare it with the maximum-likelihood approach. For this example, we find that confidence 
intervals derived from the posterior distribution are valid, while confidence intervals derived from bootstrapping 
the maximum-likelihood estimate can be too small. The reader who is interested in a Bayesian approach to 
luminosity function estimation, and how it compares with maximum-likelihood, should consult this section. 

• In § H] we develop a mixture of Gaussian functions model for the luminosity function, deriving the likelihood 
function and posterior distribution for the model. Under this model, the LF is modeled as a weighted sum 
of Gaussian functions. This model has the advantage that given a suitably large enough number of Gaussian 
functions, it is flexible enough to give an accurate estimate of any smooth and continuous LF. This allows the 
model to adapt to the true LF, thus minimizing the bias that can result when assuming a parametric form of 
the LF. This is particularly useful when extrapolating beyond the flux limits of a survey, where bias caused by 
parametric misspecification can be a significant concern. The reader who are interested in employing the mixture 
of Gaussian functions model should consult this section. 
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• Because of the large number of parameters often associated with luminosity function estimation, Bayesian in- 
ference is most easily performed by obtaining random draws of the LF from the posterior distribution. In § [5] 
we describe the Metropolis-Hastings algorithm (MHA) for obtaining random draws of the LF from the posterior 
distribution. As an example, we describe a MHA for obtaining random draws of the parameters for a Schechter 
function from the posterior distribution. Then, we describe a more complex MHA for obtaining random draws 
of the parameters for the mixture of Gaussian functions model. The reader who is interested in the computa- 
tional aspects of 'fitting' the mixture of Gaussian functions model, or who is interested in the computational 
aspects of Bayesian inference for the LF, should consult this section. A computer routine for performing the 
Metropolis-Hastings algorithm for the mixture of Gaussian functions model is available on request from B. Kelly. 

• In § [S] we use simulation to illustrate the effectiveness of our Bayesian Gaussian mixture model for luminosity 
function estimation. We const ruct a simulated data set similar to the Sloan Digital Sky Survey DR3 Quasar 
Catalog (|Schneider et a 1. 1 [2() (),">;■. We then use our mixture of Gaussian functions model to recover the true LF 
and show that our mixture model is able to place reliable constraints on the LF. We also illustrate how to use 
the MHA output to constrain any quantity derived from the LF, and how to use the MHA output to assess the 
quality of the fit. The reader who is interested in assessing the effectiveness of our statistical approach, or who 
is interested in using the MHA output for statistical inference on the LF, should consult this section. 

We a dopt a cosmology based on the the WMAP best-fit parameters (h = 0.71, Q m — 0.27, Qa = 0.73. ISpergel et aTl 
l2003h 

2. THE LIKELIHOOD FUNCTION 
2.1. Notation 

We use the common statistical notation that an estimate of a quantity is denoted by placing a 'hat' above it; e.g., 
9 is an estimate of the true value of the parameter 9. The parameter 9 may be scalar or multivalued. We denote a 
normal density 1 (i.e., a Gaussian distribution) with mean /i and variance a 2 as N(fj,,a 2 ), and we denote as AT p (/x,S) 
a multivariate normal density with p-element mean vector /x and p x p covariance matrix E. If we want to explicitly 
identify the argument of the Gaussian function, we use the notation N(x\/i, a 2 ), which should be understood to be a 
Gaussian with mean p and variance a 2 as a function of x. We will often use the common statistical notation where 
"<~" means "is drawn from" or "is distributed as" . This should not be confused with the common usage of implying 
"similar to" . For example, x ~ N(fi, a 2 ) states that x is drawn from a normal density with mean /j, and variance a 2 , 
whereas x ~ 1 states that the value of x is similar to one. 

In this work, the maximum-likelihood estimate of the luminosity function refers to an estimate of the LF obtained 
by maximizing the likelihood function of the unbinned data. Therefore, the maximum-likelihood estimate does not 
refer to an estimate obtained by maximizing the likelihood function of binned data, such as fitting the results obtained 
from the 1/14 technique. 

2.2. Derivation of the Luminosity Function Likelihood 

The luminosity function, denoted as <fi(L, z)dL, is the number of sources per comoving volume V(z) with luminosities 
in the range L, L + dL. The luminosity function is related to the probability density of (L, z) by 

p(L,z) = ±<P(L,z)^, (1) 

where N is the total number of sources in the observable universe, and is given by the integral of <fi over L and V{z). 
Note that p(L, z)dLdz is the probability of finding a source in the range L,L + dL and z, z + dz. Equation |T]) separates 
the LF into its shape, given by p(L, z), and its normalization, given by N. Once we have an estimate of p(L, z), we can 
easily convert this to an estimate of tf>(L, z) using Equation |T]). In general, it is easier to work with the probability 
distribution of L and z, instead of directly with the LF, because p(L, z) is more directly related to the likelihood 
function. 

If we assume a parametric form for <f>(L, z), with parameters 9, we can derive the likelihood function for the observed 
data. The likelihood function is the probability of observing one's data, given the assumed model. The presence of 
flux limits and various other selection effects can make this difficult, as the observed data likelihood function is not 
simply given by Equation (flj . In this case, the set of luminosities and redshifts observed by a survey gives a biased 
estimate of the true underlying distribution, since only those sources with L above the flux limit at a given z are 
detected. In order to derive the observed data likelihood function, it is necessary to take the survey's selection method 
into account. This is done by first deriving the joint likelihood function of both the observed and unobserved data, 
and then integrating out the unobserved data. 

Because the data points are independent, the likelihood function for all N sources in the universe is 

N 

p(L,z\9) = l[p(L l ,z l \9). (2) 

<=i 

1 We use the terms probability density and probability distribution interchangeably. 
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In reality, we do not know the luminosities and redshifts for all N sources, nor do we know the value of N, as our 
survey only covers a fraction of the sky and is subject to a selection function. As a result, our survey only contains n 
sources. Because of this, the selection process must also be included in the probability model, and the total number 
of sources, N, is an additional parameter that needs to be estimated. 

We can incorporate the sample selection into the likelihood function by including the random detection of sources. 
We introduce an TV-element indicator vector I that takes on the values Ii = 1 if the i th source is included in our survey 
and Ii = otherwise. Note that I is a vector of size N containing only ones and zeros. In this case, the selection 
function is the probability of including a source given L and z, p{Ii — l\Li,Zi). The complete data likelihood is then 
the probability that all objects of interest in the universe (e.g., all quasars) have luminosities L\, . . . , Lpf and redshifts 
Zx, ■ ■ ■ , Zjsr, and that the selection vector / has the values I±, . . . , In, given our assumed luminosity function: 

p(L,z,I\0,N) = C% 11 p(I i = l\L i ,z i )p(L h z i \9) J[ pft = Q\L h Zj )p(L h z s \9). (3) 
ieAob S jeAmis 

Here, C% = N\/n\{N — n)\ is the binomial coefficient, A obs denotes the set of n included sources, and Amis denotes 
the set of N — n missing sources. The number of sources detected in a survey is random, and therefore the binomial 
coefficient is necessary in normalizing the likelihood function, as it gives the number of possible ways to select a subset 
of n sources from a set of N total sources. 

Because we are interested in the probability of the observed data, given our assumed model, the complete data 
likelihood function is of little use by itself. However, we can integrate Equation ([3]) over the missing data to obtain 
the observed data likelihood function. This is because the marginal probability distribution of the observed data is 
obtained by integrating the joint probability distribution of the observed and the missing data over the missing data: 



x 

j£A„ 



p(L obs ,z obs ,I\9,N)=C% Yl p(I i = l\L i ,z i )p{L i ,z i \9) (4) 

ieA oba 

/>oc y* oo 

IT / / P{lj =Q\L j ,z ] )p(L j ,z ] \6) dL 3 dzj (5) 
eA ■ •'° ^° 

ieA oba 

where the probability that the survey misses a source, given the parameters 8, is 

p(I = Q\9) = J J p(I = 0\L,z)p(L,z\9) dL dz. (7) 

Here, we have introduced the notation that L obs and z b s denote the set of values of L and z for those sources included 
in one's survey, and we have omitted terms that do not depend on or N from Equation ([6]). Equation ([6]) is the 
observed data likelihood function, given an assumed luminosity function (Eq.[T]). Qualitatively, the observed data 
likelihood function is the probability of observing the set of n luminosities L\,...,L n and redshifts zi, . . . ,z n given the 
assumed luminosity function parameterized by 9, multiplied by the probability of not detecting N — n sources given 9, 
multiplied by the number of ways of selecting a subset of n sources from a set of N total sources. The observed data 
likelihood function can be used to calculate a maximum likelihood estimate of the luminosity function, or combined 
with a prior distribution to perform Bayesian inference. 

2.3. Comparison with the Poisson Likelihood 

The observed data likelihood given by Equation ([6]) differs from that commonly used in the lum i nosity function 
literature. Instead, a likelihood based on the Poisson distribution is often used. Marshall et al. ( 1983) give the 
following equation for the log-likelihood function based on the Poisson distribution: 

log P (L obs ,z obs \e,N) = J2 log</>(Li,Zi\N,0)+ J [p(I=l\L,z)<t>(L,z\6,N)^jdLdz. (8) 

ieA obs 

Inserting Equation ([TJ) for <p(L, z\ff), the log-likelihood based on the Poisson likelihood becomes 

logp(L obs ,z obs \e,N) = nlogN+ ^ Iogp(i<, m\9) - Np(I = 1\9), (9) 

ieAobs 

where, p(I = 1\6) = 1 — p(I = 0\9), and p(I = 0\9) is given by Equation ([7]). In contrast, the log-likelihood we have 
derived based on the binomial distribution is the logarithm of Equation ^ : 

logp(L obs ,z obs \e,N)=logN!-logn!-\og(N-n)\+ ^ logp(L i} Zi\9) + (TV — n) logp(J = O|0). (10) 

ieA ob , 

The likelihood functions implied by Equations © and (fTT)]) are functions of N, and thus the likelihoods may also be 
maximized with respect to the LF normalization. This is contrary to what is often claimed in the literature, where 
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the LF normalization is typically chosen to make the expected number of sources observed in one's survey equal to 
the actual number observed. 

The binomial likelihood, given by Equation contains the term C„ , resulting from the fact that the total number 
of sources included in a survey, n, follows a binomial distribution. For example, suppose one performed a survey over 
one quarter of the sky with no flux limit. Assuming that sources are uniformly distributed on the sky, the probability 
of including a source for this survey is simply 1/4. If there are A total sources in the universe, the total number 
of sources that one would find within the survey area follows a binomial distribution with N 'trials' and probability 
of 'success' p — 1/4. However, the Poisson likelihood is derived by noting that the number of sources detected in 
some small bin in (L, z) follows a Poisson distribution. Since the sum of a set of Poisson distributed random variables 
also follows a Poisson distribution, this implies that the total number of sources detected in one's survey, n, follows 
a Poisson distribution. However, n actually follows a binomial distribution, and thus the observed data likelihood 
function is not given by the Poisson distribution. The source of this error is largely the result of approximating the 
number of sources in a bin as following a Poisson distribution, when in reality it follows a binomial distribution. 

Although the Poisson likelihood function for the LF is incorrect, the previous discussion should not be taken as a 
claim that previous work based on the Poisson likelihood function is incorrect. When the number of sources included 
in one's sample is much smaller than the total number of sources in the universe, the binomial distribution is well 
approximated by the Poisson distribution. Therefore, if the survey only covers a small fraction of the sky, or if the flux 
limit is shallow enough such that n <C A, then the Poisson likelihood function should provide an accurate approximation 
to the true binomial likelihood function. When this is true, statistical inference based on the Poisson likelihood should 
only exhibit negligible error, so long as there are enough sources in one's survey to obtain an accurate estimate of 
the LF normalization. In § 13.31 we use simulate to compare results obtained from the two likelihood functions, and to 
compare the maximum-likelihood approach to the Bayesian approach. 

3. POSTERIOR DISTRIBUTION FOR THE LF PARAMETERS 

We can combine the likelihood function for the LF with a prior probability distribution on the LF parameters 
to perform Bayesian inference on the LF. The result is the posterior probability distribution of the LF parameters, 
i.e., the probability distribution of the LF parameters given our observed data. This is in contrast to the maximum 
likelihood approach, where the maximum likelihood approach seeks to relate the observed value of the MLE to the 
true parameter value through an estimate of the sampling distribution of the MLE. In Appendix §[A]we give a more 
thorough introduction to the difference between the maximum likelihood and Bayesian approaches. 

3.1. Derivation of the Posterior Probability Distribution 

The posterior probability distribution of the model parameters is related to the likelihood function and the prior 
probability distribution as 

p(6,N\L oba ,z obs ,I)(xp(6,N)p(L obs ,z obs ,I\6,N), (11) 

where p(0,N) is the prior on (6,N), and p(L obs , z obs , I\0, A) is is the observed data likelihood function, given by 
Equation (|6|). The posterior distribution is the probability distribution of and A, given the observed data, L obs and 
z obs . Because the luminosity function depends on the parameters 9 and A, the posterior distribution of 9 and A can 
be used to obtain the probability distribution of <f>(L, z), given our observed set of luminosities and redshifts. 

It is of use to decompose the posterior as p(0, N\L obs , z obs ) oc p(N\6, L obs , z obs )p{0\L obs , z obs ); here we have dropped 
the explicit conditioning on I. This decomposition separates the posterior into the conditional posterior of the LF 
normalization at a given 9 7 p(N\L obs , z obsi 0), from the marginal posterior of the LF shape, p(0\L obs , z obs ). In this 
work we assume that A and 9 are independent in their prior distribution, p(9,N) — p(N)p(0), and that the prior 
on A is uniform over log A. A uniform prior on log TV corresponds to a prior distribution on A of p(N) oc 1/iV, as 
p(\ogN)d\ogN = p(N)dN. Under this prior, one can show that the marginal posterior probability distribution of is 

p{d\L obs ,z obs )<xp{e)W = i\QT n II pfo.*!*). ( 12 ) 

i£A ob3 

where p(I = 1\0) = 1 — p(I = O\0). We derive Equation ([T2j) in Appendix § [B] (see also lGelman et al.ll200l . Under 
the assumption of a uniform prior on 0, Equation (fT2|) is equivalent to Equation (22) in lFan et alJ (|2001f ). who use a 
different derivation to arrive at a similar result. 

Under the prior p(logJV) oc 1, the conditional posterior distribution of A at a given is a negative binomial 
distribution with parameters n and p(I = l\0). The negative binomial distribution gives the probability that the total 
number of sources in the universe is equal to A, given that we have observed n sources in our sample with probability 
of inclusion p(I — 1\0): 

p(N\n, 0) = C^ 1 W = l\0)T W = mf~ n ■ (13) 

Here, p(I = O\0) is given by Equation (O and p(I = 1\0) = 1 —p(I = O\0). Further description of the negative binomial 
distribution is given in § [C] The complete joint posterior distribution of and A is then the product of Equations 
(HI and (ED, P (0, N\L obs ,z obs ) oc p(N\0, n)p(0\L obs , z obs ). 

Because it is common to fit a luminosity function with a large number of parameters, it is computationally intractable 
to directly calculate the posterior distribution from Equations (j!2p and (|13p . In particular, the number of grid points 
needed to calculate the posterior will scale exponentially with the number of parameters. Similarly, the number of 
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integrals needed to calculate the marginal posterior probability distribution of a single parameters will also increase 
exponentially with the number of parameters. Instead, Bayesian inference is most easily performed by simulating 
random draws of N and 9 from their posterior probability distribution. Based on the decomposition p{9, N\L b s , z a b s ) oc 
p(N\n, 9)p(0\L o bs, z bs), we can obtain random draws of (6,N) from the posterior by first drawing values of 9 from 
Equation (JT2J). Then, for each draw of 9, we draw a value of N from the negative binomial distribution. The values 
of N and 9 can then be used to compute the values of luminosity function via Equation {1} . The values of the LF 
computed from the random draws of N and 9 are then treated as a random draw from the probability distribution 
of the LF, given the observed data. These random draws can be used to estimate posterior means and variances, 
confidence intervals, and histogram estimates of the marginal distributions. Random draws for may be obtained via 
Markov chain Monte Carlo (MCMC) methods, described in §[51 and we describe in § [Clhow to obtain random draws 
from the negative binomial distribution. In § 16.21 we give more details on using random draws from the posterior to 
perform statistical inference on the LF. 

3.2. Illustration of the Bayesian Approach: Schechter Function 

Before moving to more advanced models, we illustrate the Bayesian approach by applying it to a simulated set 
of luminosities drawn from a Schechter function. We do this to give an example of how to calculate the posterior 
distribution, how to obtain random draws from the posterior and use these random draws to draw scientific conclusions 
based on the data, and to compare the Bayesian approach with the maximum- likelihood approach (see § I3.3[) . The 
Schechter luminosity function is: 

(p(L) = ( ^] e~ L l L \ 9=(a,L*). (14) 
v ; L*T(a + 1) \L* J ' v ' 1 y ' 

For simplicity, we ignore a z dependence. The Schechter function is equivalent to a Gamma distribution with shape 
parameter k = a + 1, and scale parameter L*. Note that k > and a > — 1; otherwise the integral of Equation 
(|14[) may be negative or become infinite. For our simulation, we randomly draw N = 1000 galaxy luminosities from 
Equation (|14|) using a value of a = and L* = 10 44 erg s _1 . 

To illustrate how the results depend on the detection limit, we placed two different detection limits on our simulated 
survey. The first limit was at L m in = 2 x 10 43 ergs s _1 , and the second was at L m in = 2 x 10 44 ergs s _1 . We used a 
hard detection limit, where all sources above L min were detected and all sources below L min were not: p(I — 1\L > 
Lmin) — 1 and p(I = l\L < L min ) = 0. Note that the first detection limit lies below L* , while the second detection 
limit lies above L*. We were able to detect n ~ 818 sources for L m i n = 2 x 10 43 ergs s _1 and n ~ 135 sources for 
L m in = 2 x 10 44 ergs s" 1 

The marginal posterior distribution of a and L* can be calculated by inserting into Equation (|12p an assumed prior 
probability distribution, p(a,L*), and the likelihood function, p(Li\a, L*). Because we are ignoring redshift in our 
example, the likelihood function is simply p(Li\a, L*) = cj>(L)/N. In this example, we assume a uniform prior on 
logL* and a, and therefore p(L*, a) oc 1/L*. From Equations (fT2"]) and (fT4")) , the marginal posterior distribution of the 
parameters is 

1 ™ 1 / T ■ \ a 

P (a,L*\L obs ) oc - = IK £*)]- ft L . T(a+ 1} f J7 ) . (15) 

where the survey detection probability is 

p(I = l\a,L*)= [°° —J——(^L) e- L ^ L " dL. (16) 



Lmin L*T(a + 1) \£ 

The conditional posterior distribution of TV at a given 9 is given by inserting in Equation (|16p into Equation (|13p , and 
the joint posterior of a, L* , and N is obtained by multiplying Equation (|15p by Equation (|13p . 

We perform statistical inference on the LF by obtaining random draws from the posterior distribution. In order 
to calculate the marginal posterior distributions, p(a\L b s ),p(L*\L i ls ), and p(N\L b s ), we would need to numerically 
integrate the posterior distribution over the other two parameters. For example, in order to calculate the marginal 
posterior of a, p{a\L b s , z b s ), we would need to integrate p(a, L* , N\L i, s ) over L* and JV on a grid of values for a. 
While feasible for the simple 3-dimensional problem illustrated here, it is faster to simply obtain a random draw of 
a, L* , and N from the posterior, and then use a histogram to estimate p(a\L b s ). Further details are given in ^ 16.21 on 
performing Bayesian inference using random draws from the posterior. 

We used the Metropolis-Hastings algorithm described in § 15.11 to obtain a random draw of a, L*, and N from the 
posterior probability distribution. The result was a set of 10 5 random draws from the posterior probability distribution 
of a,L*, and N. In Figure [T] we show the estimated posterior distribution of a,L*, and N for both detection limits. 
While L* is fairly well constrained for both detection limits, the uncertainties on a and N are highly sensitive to whether 
the detection limit lies above or below L* . In addition, the uncertainties on these parameters are not Gaussian, as is 
often assumed for the MLE. 

The random draws of a,L*, and iV can also be used to place constraints on the LF. This is done by computing 
Equation (fT4|) for each of the random draws of a,L*, and N, and plotting the regions that contain, say, 90% of the 
probability. In Figure [5] we show the posterior median estimate of the LF, as well as the region containing 90% of the 
posterior probability. As can be seen, the 90% bounds contain the true value of the LF, and increase or decrease to 
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Fig. 1. — Post erior probability distribution of the Schechter luminosity function parameters, N,a, and L* for the simulated sample 
described in § 13.21 The top three panels show the posterior when the luminosity limit of the survey is L > 2 X 10 [erg s — 1 ], and the bottom 
three panels show the posterior distribution when the luminosity limit of the survey is L > 2 X 10 44 [erg s -1 ]. The vertical lines mark the 
true values of the parameters, N = 1000, a = 0, and L* = 10 44 [erg s -1 ]. The uncertainty on the parameters increases considerably when 
L m in > L* , reflecting the fact that the bright end of the Schechter LF contains little information on a or N. 
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Fig. 2. — True value of the Schechter luminosity function (dashed line), compared with the best fit luminosity function cal cula ted from 
the posterior median of the Schechter function parameters, N, a, and L* (solid line), from the simulated sample described in § 13.21 The left 
panel summarizes the posterior probability distributin of the LF when the luminosity limit is L > 2 X 10 43 [erg s — 1 ], and the right panel 
summarizes the posterior distribution of the LF when the luminosity limit is L > 2 X 10 44 [erg s — 1 ]. In both panels the shaded region 
contains 90% of the posterior probability, and the vertical line marks the lower luminosity limit of the simulated survey. The uncertainty 
on the LF below the luminosity limit increases considerably when L m i n > L* , reflecting the fact that the bright end of the Schechter LF 
contains little information on a or N, and therefore contains little information on the faint end of the LF. 



reflect the amount of data available as a function of L. Furthermore, unlike the traditional MLE, these bounds do not 
rely on an assumption of Gaussian uncertainties, and therefore the confidence regions are valid for any sample size. 

3.3. Comparison with Maximum-likelihood: Schechter Function 

We also use Monte Carlo simulation to compare the Bayesian approach to maximum-likelihood for both the binomial 
and Poisson likelihood functions. We simulated 20 data sets for four types of surveys: (1) A large area shallow survey, 
(2) a large area medium d epth survey, (3) a small area deep survey, and (4) a large area deep survey for rare objects, 
such as z ~ 6 quasars (e.g-. lFan et al.ll2006ft . For all four survey types we simulated quasars from a Schechter luminosity 
function with parameters the same as in § 13.21 For the large area shallow survey we used a total number of sources of 



Kelly et al. 



TABLE 1 

schechteh function confidence intervals: maximum-llkelihood vs. bayesian 

Inference 





Large Area, Shallow Large Area, Medium Small Area, Deep 


Rare Object 




Normalization, N 


Poisson 


19 


16 


10 


11 


Binomial 


19 


19 


12 


10 


Bayesian 


20 


19 


19 


20 


Faint End Power-law Slope, k 


Poisson 


18 


17 


18 


17 


Binomial 


18 


19 


18 


IS 


Bayesian 


20 


20 


18 


19 


Bright End Exponential Cut-off, L* 


Poisson 


20 


19 


19 


18 


Binomial 


20 


20 


19 


18 


Bayesian 


20 


20 


18 


20 



Note. — Table^gives the number of times the true value of each parameter was contained within 
the estimated 95% confidence interval for the simulated data sets described in § 13.31 The results 
are reported sepcrately for each type of survey and Schcchtcr function parameter. We simulated 20 
data sets for each type of survey. 



N = 10 5 , an area of f2 = 10 4 deg 2 , and a lower luminosity limit of L min = 5 x 10 44 erg s 1 . Only n ~ 160 sources are 
expected to be detected by this survey. For the large area medium depth survey we also used a LF normalization of 
N — 10 5 and area of = 10 4 deg 2 , but instead used a lower luminosity limit of L m i n — 5 x 10 43 erg s _1 . The large area 
medium depth survey is expected to detect n ~ 1.5 x 10 4 sources. For the small area deep survey we used a survey area 
of Q = 448 arcmin 2 , a LF normalization of N = 5 x 10 7 sources, and a lower luminosity limit of L m i n = 10 43 erg s — 1 . 
This survey is expected to detect n ~ 140 sources. Finally, for the large area deep rare object survey we used an area 
of — 10 4 deg 2 , a LF normalization of N = 75 sources, and a lower luminosity limit of L m ; m = 10 43 erg s _1 . Only 
n ~ 16 sources are expected to be detected by the rare object survey. 

We fit each of the 20 simulated data sets by maximum-likelihood for both the binomial and Poisson likelihood 
functions. The 95% confidence intervals on the best-fit parameters were determined using 2000 bootstrap samples. 
We generate each bootstrap sample by randomly drawing n data points with replacement from the original sample, 
and then calculate the maximum-likelihood estimates for each bootstrap sample. We use the bootstrap to estimate 
the confidence intervals because the bootstrap does not assume that the errors are Gaussian, and because it is a 
common technique used in the LF literature. We estimate the 95% confidence intervals directly from the 0.025 and 
0.975 percentil es of the bootstrap sample. While bo otstrap confidence intervals derived in this manner are known to 
be biased (e.g.,[Efro3[i983;[D avison fc Hin klev 1997), additional corrections to the bootstrap samples are complicated. 
In addition, it is common practice to estimate bootstrap confidence intervals in this manner, and it is worth testing 
their accuracy. For the Bayesian approach, we used the MHA algorithm described in 15 . II to simulate 5 x 10 4 random 
draws from the posterior distribution. The MHA algorithm was faster than fitting the 2000 bootstrap samples using 
maximum likelihood. 

For each of the simulated samples, we counted the number of times that the true values of N, a, and L* were 
contained within the estimated 95% confidence interval. The results are summarized in Table [TJ Because we estimated 
values of three parameters for 20 simulated data sets of 4 different types of surveys, we had 240 'trials' with probability 
of 'success' p — 0.95. If the estimated 95% confidence regions corresponded to the true regions, then with w 99% 
probability between 220 and 236 of the 'trials' would fall within the estimated confidence region. For the binomial 
likelihood, the true value of a parameter was within the estimated 95% confidence region only 210 times (88%), and 
for the Poisson likelihood the true value of a parameter was within the estimated 95% confidence region only 202 times 
(84%). In contrast, the Bayesian approach was able to correctly constrain the true value of a parameter to be within the 
95% confidence region 233 times (97%). Therefore, for our simulations confidence regions derived from bootstrapping 
the maximum-likelihood estimate are too narrow, while the confidence regions derived from the Bayesian method are 
correct. 

Most of the failure in the maximum- likelihood confidence intervals came from the difficulty of the maximum- likelihood 
approach in constraining the LF normalization, N, for the small area deep survey and for the rare object survey. In 
particular, for these two surveys the bootstrap 95% confidence intervals for both the binomial and Poisson likelihood 
function only contained the true value of N roughly 50% of the time. In general, there wasn't a significant difference 
among the simulated surveys in the ability of the three different statistical approaches to constrain k and L* at 95% 
confidence. However, the Poisson and binomial likelihood functions gave slightly different results for the larger area 
medium depth survey. For this survey the 95% confidence intervals for the maximum-likelihood estimate derived from 
the Poisson distribution were somewhat smaller than those for the binomial distribution, only correcting including the 
true values of N, k, and L* roughly 85% of the time. This is expected, because the Poisson distribution is the limit of 
the binomial distribution as the probability of including a source approaches zero; however, the detection probability 



Estimating Luminosity Functions 



9 



for the large area medium depth survey is ss 0.15. 

The results of our simulations imply that the Poisson likelihood function may lead to biased estimates of confidence 
intervals on the luminosity function parameters, so long as an astronomical survey is able to detect a significant 
fraction of the objects of interest. Use of the Poisson likelihood function may thus be problematic for the large 
astronomical surveys becoming common, so long as they are deep enough to achieve a moderate detection fraction. 
For the simulations performed here, using the Poisson likelihood function resulted in confidence intervals that are too 
narrow. However, these simulations were for a Schechter luminosity function, and other parameterizations may be 
affected differently. Considering that there are no obvious computational advantages to using the Poisson likelihood 
function, we recommend that the correct binomial likelihood function be used, as it is the correct form. 



4. MIXTURE OF GAUSSIAN FUNCTIONS MODEL FOR THE LUMINOSITY FUNCTION 

In this section we describe a mixture of Gaussian functions model for the luminosity function. The mixture of 
Gaussians model is a common and well studied 'non-parametric' model that allows flexibility when estimating a 
distribution, and is often employed when there is uncertainty regarding the specific functional form of the distribution 
of interest. The basic idea is that one can use a suitably large enough number of Gaussian functions to accurately 
approximate the true LF, even though the individual Gaussians have no physical meaning. In this sense, the Gaussian 
functions serve as a basis set of the luminosity function. As a result, we avoid the assumption of a more restrictive 
parametric form, such as a power-law, which can introduce considerable bias when extrapolating beyond the bounds 
of the observable data. We have not experimented with other luminosity function basis sets, although mixture models 
are very flexible and need not be limited to Gaussian functions. 

In this work we assume the mixture of Gaussian functions for the joint distribution of log L and log z, as the logarithm 
of a strictly positive variable tends to more closely follow a normal distribution than does the untransformed variable. 
Therefore, we expect that a fewer number of Gaussians will be needed to accurately approximate the true LF, thus 
reducing the number of free parameters. Assuming a mixture of Gaussian functions for the joint distribution of log L 
and logz is equivalent to assuming a mixture of log-normal distributions for the distribution of L and z. The mixture 
of K Gaussian functions model for the i th data point is 



p(\og Li, log Zi\ir,n,Y<) 



TTfc 



K 



2ir St. 
fc=i 1 K 



1/2 



exp 



-(x,; - Mfc) T S fe X (xi - fi k ) 



(7T,/X,S), 



(17) 



where 5Zk=i n k — 1- Here, = (log Li, log Zi), /i^ is the 2-element mean (i.e., position) vector for the k th Gaussian, 
Sfc is the 2x2 covariance matrix for the fc th Gaussian, and x T denotes the transpose of x. In addition, we denote 
7r = (wi, . . . ,ttk), A* — (Ml) • • • ! P-k), and S = (Si, . . . , S#). The variance in logL for Gaussian k is of k = Sn^, 
the variance in logz for Gaussian k is a z k — S22,fc, and the covariance between logL and logz for Gaussian k is 
ai z ^k = Si2,fc. In this work we consider the number of Gaussian components, K , to be specified by the researcher and 
fixed. If the number of Gaussian functi ons is also considered to be a free parameter, then methods exist for performing 
Bayesian inference on K as well (e.g., iRichardson fc Greenlll997T ). Statistical inference on the luminosity functions 
studied in this work were not sensitive to the choice of K so long as K > 3, and we conclude that values of K > 3 
should be sufficient for most smooth and unimodal luminosity functions. 

Under the mixture model, the LF can be calculated from Equations (fT]) and (fl7|) . Noting that p{L,z) = 
p(logL,logz)/(L^(lnlO) 2 ), the mixture of Gaussian functions model for the LF is 



<t>(L,z\6,N) 



N 



Lz(lnlO) 2 



-Yt 

dz I ^ 2?r S fe 

7 fc=i 1 K 



1/2 



exp 



-i(x-^ fe ) T S fe 1 (x-/i A; ; 



(18) 



where, as before, x = (logL, logz). A mixture of Gaussian functions models was also used bv lBlanton et all (|2003l ) to 
estimate the z = 0.1 gal axy LF from th e Sloa n Digital Sky Survey (SDSS). Our mixture of Gaussian functions model 
differs from that used bv lBlanton et al.l (j2003) in that we do not fix the Gaussian function centroids to lie on a grid of 
values, and their individual widths are allowed to vary. This flexibility enables us to use a smaller number of Gaussian 
functions (typically ~ 3 — 6) to accurately fit the LF. 

4.1. Prior Distribution 

In this section we describe the prior distribution that we adopt on the mixture of Gaussian functions parameters. 
While one may be tempted to assumed a uniform prior on ir, \i, and S, this will lead t o an improper posterior, i.e., 
the posterior probability density does not integrate to one (jRoeder fc Wassermanlll997l). Therefore, a uniform prio r 
cannot be used, and we need to develop a more informative prior distribution. Following lRoeder fc W asscrmanl (|1997h . 

we assume a uniform prior on tti, . . . , ttk under the constraint that X^fc=i 77 k = 1; formally, this is a Dirichlet(l, . . . , 1) 
prior, where Dirichlct(ai, . . . , ape ) denotes a Dirichlct density with parameters ot\, . . . , an- We give further details on 
the Dirichlet probability distribution in Appendix § [Cj 

Although our prior knowledge of the LF is limited, it is reasonable to assume a priori that the LF should be 
unimodal, i.e., that the LF should not exhibit multiple peaks. Currently, the observed distributions of galaxies and 
AGN are consistent wi th this assumption. For galaxies, unimo dal luminosity function is implied by simple models of 
galaxy formation (e.g. JPress fc S chechter 1974; Schechter 1976), and the luminosity functions for galaxies classified by 
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Fig. 3. — An illustration of our prior distribution for the Gaussian function parameters, for K = 5 Gaussian functions. Shown are a 
case when the Gaussian functions used in modelling the luminosity function are close together with respect to their covariances (left), and 
when the Gaussian functions are far apart with respect to their covariance matrices (right). The marginal distributions of log z are shown 
above the plots, and the marginal distributions of logL are shown to the right of the plots. When the Gaussian functions are close, the 
LF is unimodal, but when the Gaussian functions are far apart, the LF is multimodal. Because our prior distribution is constructed to 
place more probability on situations when the Gaussian functions are closer together with respect to their individual covariance matrices, 
it would place more probability on the situation shown in the left plot a priori. Our prior therefore reflects our expectation that the LF 
should not exhibit multiple peaks (modes). 



morphology are not sufficiently separated to create multimodality (e.g.. iNakamura et al.ll2003t IScarlata et ai1l2007f ). 
However, the luminosity function for AGN may be bimodal due to the possible existence of two difference accretion 
modes. This is largely due to the fact that the distribution of accretion rates relativ e to the Eddington rate, m, is 
likely bimodal (|Holl2002t iMarchesini et al.ll200i [Hopkins et al.ll2006bllCao fc Xul 12001 as a result of a transition from 
a radiatively inefficient flow to an efficient one at fa ~ 0.01 (e.g.. Pesterll2005D "Because L oc M^^m, the distribution 
of luminosities for AGN may be bimodal. If a survey is unable to detect those AGN in the faint radiatively inefficient 
mode, then the assumption of unimodality is violated and the estimated AGN luminosity function only refers to those 
AGN in the bright radiatively efficient mode. 

To quantify our prior assumption that the LF is more likely to be unimodal, we construct our prior distribution 
to place more probability on situations where the individual Gaussian functions are close together in terms of their 
widths. In addition, we only specify the parametric form of the prior distribution, but allow the parameters of the 
prior distribution to vary and to be determined by the data. This allows our prior distribution to be flexible enough to 
have a minimal effect on the final results beyond conveying our prior consideration that the LF should be unimodal. 
We introduce our prior to place more probability on unimodal luminosity functions, to ensure that the posterior 
integrates to one, and to aid in convergence of the MCMC. Figure |3] illustrates the general idea that we are attempting 
to incorporate into our prior distribution. In this figure, we show a situation where the Gaussian functions are close 
together with respect to their widths, and far apart with respect to their widths. When the distances between the 
individual Gaussian functions, normalized by their covariance matrices (the measure of their 'width'), is small, the LF 
is unimodal; however, when the distances between the Gaussian functions are large with respect to to their covariance 
matrices, the LF exhibits multiple modes. We construct a prior distribution that places less probability on the latter 
situation. 

Our prior on the Gaus sian mean vectors and covariance matrices is similar to the prior described by 
iRoeder fe Wassermanl (|1997|) . but generalized to 2-dimensions. For our prior, we assume an independent multivariate 
Cauchy distribution for each of the Gaussian means, /i, with 2-dimensional mean vector [Lq and 2x2 scale matrix T . A 
Cauchy distribution is equivalent to a student's t distributions with 1 degree of freedom, and when used as a function 
in astronomy and physics it is commonly referred to as a Lorentzian; we describe the Cauchy distribution further in 
§[C]of the appendix. The scale matrix is chosen to be the harmonic mean of the Gaussian function covariance matrices: 

Hip?)' 1 - 

Qualitatively, this prior means that we consider it more likely that the centroids of the individual Gaussian functions 
should scatter about some mean vector /^o, where the width of this scatter should be comparable to the typical width 
of the individual Gaussian functions. The prior mean, fio, is left unspecified and is an additional free parameter to be 
estimated from the data. We choose a Cauchy prior because the Cauchy distribution is heavy tailed, and therefore 
does not heavily penalize the Gaussian functions for being too far apart. As a result, the Cauchy prior is considered 
to be robust compared to other choices, such as the multivariate normal distribution. 

Because we use a random walk computational technique to explore the parameter space and estimate the posterior 
distribution, we find it advantageous to impose additional constraints on the Gaussian centroids. Both \x and /iq are 
constrained to the region log Li ow < m tk < log L high and logz tow < fi z . k < logz high , where fi^ k is the mean in logL 
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for the k Gaussian, [iz.k is the mean in Xogz for the k Gaussian. These constraints are imposed to keep the Markov 
chains (see § [5]) from 'wandering' into unreasonable regions of the parameter space. The flux limit sets a lower limit 
on the luminosity of detected sources as a function of z, and therefore there is nothing in the observed data to 'tell' 
the random walk that certain values of fii are unreasonable. For example, suppose our survey is only able to detect 
quasars with L > 10 10 L Q . Because of this, there is nothing in our data, as conveyed through the likelihood function, 
that says values of, say, L ~ IOLq are unreasonable, and thus the Markov chains can get stuck wandering around 
values of fj,i ~ 1. However, we know a priori that values of \xi ~ 1 are unphysical, and therefore it is important to 
incorporate this prior knowledge into the posterior, as it is not reflected in the likelihood function. The values of these 
limits should be chosen to be physically reasonable. As an example, for the SDSS DR3 quasar LF with luminosities 
measured at ALa(2500A), it might be reasonable to take Li ow = 10 40 erg s" 1 , Lhigh — 10 48 erg s _1 , zi ow = 10~ 4 , and 

Zhigh 

Generalizing the prior of iRoeder fc Wassermanl (|1997f L we assume independent inverse Wishart priors on the indi- 
vidual Gaussian covariance matrices with v = 1 degrees of freedom, and common scale matrix A. We give a description 
of the Wishart and inverse Wishart distributions in § [Cj This prior states that the individual Efc are more likely to 
be similar rather than different. The common scale matrix, A, is left unspecified so it can adapt to the data. As 
with fx, we recommend placing upper and lower limits on the allowable values of dispersion in logL and logz for each 
Gaussian.. 

Mathematically, our prior is 

K 

p(tt,(i, E,/io, A) oc JJK/^Imo, E)p(E fc |A) (20) 
fc=i 

K 

oc JJ Cauchy 2 (^ fe |/i ,T)Inv-Wisharti(E fe |A), (21) 
fc=i 

under the constraints given above. Here, Cauchy 2 (/ifc|/^o, T) denotes a 2-dimensional Cauchy distribution as a function 
of jitfe, with mean vector p,o and scale matrix T. In addition, Inv-Wisharti(E/ c | J 4) denotes an inverse Wishart density 
as a function of E&, with one degree of freedom and scale matrix A. We have also experimented with using a uniform 
prior on the parameters, constricted to some range. In general, this did not change our constraints on the LF above the 
flux limit, but resulted in somewhat wider confidence regions on the LF below the flux limit. This is to be expected, 
since our adopted Cauchy prior tends to restrict the inferred LF to be unimodal, and therefore limits the number of 
possible luminosity functions that are considered to be consistent with the data. 

4.2. Posterior Distribution for Mixture of Gaussians Model 

Now that we have formulated the prior distribution, we can calculate the posterior distribution for the mixture of 
Gaussians model of <f>(L, z). Because we have formulated the mixture model for the LF in terms of logL and log 2, the 
marginal posterior distribution of 9 is 

p(0, fio,A\ log L obs , log z obs ) <xp(9, hq, A) \p(I= 1|0)P J[ p(log Li, log Zi\9), # = K/i,E), (22) 

ieAoba 

where p(0, fio, A) is given by Equation ([21"]) . p(\og Li, log Zi\9) is given by Equation (fl7|) . and 

/oo />oo 
/ p(I = l\\ogL, log z)p(\ogL,\ogz\9) dlogL d\ogz (23) 
-oo J —oc 

is the probability of including a source, given the model parameters 9. The conditional posterior distribution of N 
given 7r,/i, and S is given by inserting Equation (|23|) into (|13p . The complete joint posterior distribution is then 

p{9, N, no, A\ log L obs ,\ogz obs ) oc p(N\9, n)p(9, Ho, A \ \ogL obs , log z obs ). (24) 

5. USING MARKOV CHAIN MONTE CARLO TO ESTIMATE THE POSTERIOR DISTRIBUTION OF THE LUMINOSITY 

FUNCTION 

For our statistical model, fio has 2 free parameters, A has 3 free parameters, and each of the K Gaussian components 
has 6 free parameters. Because the values of 7r are constrained to sum to one, there are only 6K — 1 free parameters 
for the Gaussian mixture model. The number of free parameters in our statistical model is therefore 6K + 4. The large 
number of parameters precludes calculation of the posterior on a grid of it, /i, E, /iq, A, and N. Furthermore, the multiple 
integrals needed for marginalizing the posterior, and thus summarizing it, are numerically intractable. Because of this, 
we employ Markov Chain Monte Carlo (MCMC) to obtain a set of random draws from the posterior distribution. A 
Markov chain is a random walk, where the probability distribution of the current location only depends on the previous 
location. To obtain random numbers generated from the posterior distribution, one constructs a Markov chain that 
performs a random walk through the parameter space, where the Markov chain is constructed to eventually converge 
to the posterior distribution. Once convergence is reached, the values of the Markov chain are saved at each iteration, 
and the values of these locations can be treated as a random draw from the posterior distribution. These draws may 
then be used to estimate the posterior distribution of <f>(L, z), and thus an estimate of the LF and its uncertainty can 
be obtained. 
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Fig. 4. — Schematic diagram illustrating the random walk Metropolis-Hastings algorithm. The posterior probability distribution is 
illustrated by the contours, and the random walk is initially at the position marked with a square. A new proposed value of $i is randomly 
drawn, marked by the arrow pointing to the left. Because the proposed value of 8\ is at a location with higher posterior probability, the 
new value of 8\ is saved, and the random walk 'jumps' to the position marked by the arrow. Then, a new proposal for 82 is randomly 
drawn, marked by the arrow pointing upward. Because this proposed value of 82 is at a location with lower posterior probability, it is only 
accepted with probability equal to the ratio of the values of the posterior at the proposed position and the current position. If the proposed 
value is kept, then the new value of 82 is saved, otherwise the current value of 82 is saved. Next, a proposed value of 81 is randomly drawn, 
and the process repeats, creating a random walk through the parameter space. Because the amount of time that the random walk spends 
in any given bin in 81 and 82 is proportional to the posterior probability distribution, after the random walk has converged, the values of 
81 and 82 from the random walk may be treated as a random draw from the posterior distribution. 

In this work we use the Metropolis-Hastings algorithm (MHA, Metrop olis fc Ulaml [1949; Me tropolis et alj 119531 : 
Hastings 1970) to perform the MCMC. We describe the particular MHA we employ for Bayesian inference on the LF; 
however for a more g eneral and complete description of the MHA, we refer the reader to IChib fc Greenbergl (|1995| ) or 
IGelman et all pOOl . We use the MHA to obtain a set of random draws from the marginal posterior distribution of 
9, given by Equation (|22[) . Then, given these random draws of 9, random draws for N may be obtained directly from 
the negative binomial distribution (see Eq. [13] and § [C] of the appendix) . 

The basic idea behind the MHA is illustrated in Figure [4] for the special case of a symmetric jumping distribution. 
First, one starts with an initial guess for 9. Then, at each iteration a proposed value of 9 is randomly drawn from some 
'jumping' distribution. For example, this jumping distribution could be a normal density with some fixed covariance 
matrix, centered at the current value of 9. Then, if the proposed value of 9 improves the posterior, it is stored as the 
new value of 9. Otherwise, it is stored as the new value with probability equal to the ratio of the values of the posterior 
distribution at the proposed and current value of 9. If the proposed value of 9 is rejected, then the value of 9 does not 
change, and the current value of 9 is stored as the 'new' value of 9. The process is repeated until convergence. If the 
jumping distribution is not symmetric, then a correction needs to be made to the acceptance rule in order to account 
for asymmetry in the jumping distribution. A jumping distribution is symmetric when the probability of jumping 
from a current value 9 to a new value 9* is the same as jumping from 9* to 9. For example, the normal distribution 
is symmetric, while the log-normal distribution is not. 

5.1. Metropolis- Hastings Algorithm for Schechter Luminosity Function 

Before describing our MHA algorithm for the mixture of Gaussian functions model, we describe a simpler MHA 
algorithm for the Schechter function model given by Equation (fT4")l in § 13.21 We do this to illustrate the MHA using a 
more familiar luminosity function. An MHA for obtaining random draws of a, L* , and N from Equation (I15p is: 

1. Start with an initial value of a and L* , denoted as a and L*. A good initial value is the maximum- likelihood 
estimate. 
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2. Draw a proposal value of logL* from a normal distribution centered on the current value of logL*, logL* . The 
variance in the jumping distribution of logL* should be fixed at the beginning of the MHA. A larger jumping 
variance will lead to jumps that travel greater distances, but will then lead to lower MHA acceptance rates. 
The value of the jumping variance should be tuned to give acceptance rates ~ 0.4. We use a normal jumping 
distribution to vary logL* because logL* is defined on (— oo, oo), while L* is only defined on (0, oo). While we 
could use a jumping distribution to directly vary L* , it is not always easy to simulate random variables directly 
from distributions that are only defined for L* > 0. 

Denoting the proposal value of L* as L *, calculate the ratio 

L*p(a,L*\L obs ) 

ri* = = (25) 

L*p(a,L*\L obs ) 

Here, p(a, L*\L obs ) is the posterior distribution for the Schechter function, given by Equation Q15j). If r^* > 1, 
then keep the proposal and set L* = L*. If tl* < 1, then draw a random number u uniformly distributed 
between and 1. If u < n>, then keep the proposal and set L* = L*. Otherwise, the proposal is rejected 
and the value of L* is unchanged. The factor of L/L* is necessary in Equation (j25|) in order to correct for the 
asymmetry in the log-normal jumping distribution. 

3. Draw a proposal value of logfc = log(a + 1) from a normal distribution centered at the current value of logfc, 
log k — log(a + 1). Similar to the MHA step for L*, we use a normal jumping distribution to vary log k because 
logfc is defined on (— oo, oo), while a is only defined on (—1, oo). 

Denoting the proposal value of fc as k, the proposal value of a is a = k — 1. Using the values of a and a, calculate 
the ratio 

kp(a,L*\L obs ) 

r a = - = (26) 

kp(a,L*\L obs ) 

If r a > 1, then keep the proposal and set a = a. If r a < 1, then draw a random number u uniformly distributed 
between and 1. If u < r a , then keep the proposal and set a — a. Otherwise, the proposal is rejected and the 
value of a is unchanged. As with the MHA step for L*, the factor of fc/fc is necessary in Equation (|26|) in order 
to correct for the asymmetry in the log-normal jumping distribution. 

4. Repeat steps j2])-([3]) until the MHA algorithm converges. Techniques for monitoring convergence are described 
in iGelman et al.l (|2004[ K After convergence, use Equation (|13| to directly simulate random draws of the LF 
normalization, N, for each simulated value of a and L* obtained from the above random walk. Equation (I13|) 
has the form of a negative binomial distribution, and a method for simulated random variables from the negative 
binomial distribution is described in § [C] of the appendix. 

5.2. Metropolis-Hastings Algorithm for the Mixture of Gaussian Functions Luminosity Function 

Our MHA for the mixture of Gaussian functions model is a more complex version of that used for the Schechter 
function model. As before, we denote the current value of a parameter by placing a "over its symbol, and we denote 
the proposed value by placing a * over its symbol. For example, if one were updating tt, then tt denotes the current 
value of tt in the random walk, and f} denotes the proposed value of tt. We will only update one parameter at a time, 
so, if we are drawing a proposal for tt, the current value of 9 is denoted as 9 — (tt, fx, E), and the proposed value of 9 
is denoted as 9 — (tt, fx, E). 

Our MHA for the mixture of Gaussian functions model is: 

1. Start with initial guesses for tt, lx, S, lxq, A, and T. 

2. Draw a proposal value for tt from a Dirichlet (c/i, . . . ,()k) density, where gu = c^titt^ + 1, n is the number of 
sources in the survey, and c T is a fixed positive constant that controls how far the 'jumps' in tt go. Because c v 
controls the variance of the Dirichlct density, a smaller value of produces values of tt that are further from jr. 
The value of c T should be chosen so that about 15-40% of the MHA proposals are accepted. 

After drawing a proposal for tt, calculate the value of the posterior distribution at the new value of 9, 9, and at 
the old value of 9, 9. Then, use these values to calculate the ratio 

^ _ Dirichlct (tt \g) p(9\L obs , z obs ) 

Dirichlet(7r|<?) p(6\L obs , z ohs ) ' 

where g = c^uttx + 1, . . . , c^nrtK + 1- The ratio of Dirichlet densities in Equation (|2"T)) corrects the MHA 
acceptance rule for the asymmetry in the Dirichlct jumping distribution. If r„ > 1 then keep the proposed value 
of tt: tt = tt. Otherwise keep the proposal with probability r ff . This is done by drawing a uniformly distributed 
random variable between and 1, denoted by u. If u < r„, then set tt = if. If u > r n then keep the current value 

of TT. 

Methods for simulating from the Dirichlet distribution, as well as the functional form of the Dirichlct distribution, 
are given in §[Cl 
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For each Gaussian function, draw a proposal for p k by drawing fa ~ N^ifiki Vk), where Vk is some set covariance 
matrix. Because the jumping density is symmetric, the MHA acceptance ratio is just given by the ratio of the 
posterior distributions at the proposed and current value of p k - = p(9\L b s , z b s )/p(9\L b s , z b s ). If > 1 
then set fa = fai otherwise set fa = fa with probability r M . The MHA update should be performed separately 
for each Gaussian. The covariance matrix of the jumping kernel, Vk, should be chosen such that ~ 30% of the 
MHA jumps are accepted. 

Since we have constructed the prior distribution with the constraint log Li ow < fi^i- < log Lhigh and log z m in < 
Mz,fc < log Zhigh for all k, any values of fa that fall outside of this range should automatically be rejected. 

For each Gaussian, draw a proposal for E& by drawing E& ~ Wish-art^ (£&/!/&), where Vk is some set degrees of 
freedom. Larger values of v\. will produce values of E& that are more similar to £fe. The MHA acceptance ratio 
is 

/ - \ y fc -3/2 

r £ = ( 77—: I exp <| — —tr (E fc J E fe - E fc E fc j — -, (28) 



2 



p(9\x obs ) 



where tr(-) denotes the trace of a matrix. If > 1 then set E& = otherwise set E& = E& with probability 
rs. The MHA update should be performed separately for each Gaussian. The degrees of freedom of the jumping 
kernel, Vk, should be chosen such that ~ 15-40% of the MHA jumps are accepted. 

If there are any bounds on Ej; incorporated into the prior distribution, then values of E& that fall outside of this 
range should automatically be rejected. Methods for simulating from the Wishart distribution, as well as the 
functional form of the Wishart distribution, are given in § [Cl 

Draw a proposal for the prior parameter p Q as (1q ~ N 2 (po, Vo). The acceptance ratio only depends on the prior 
distribution and is 



ro 



tt Cauchy 2 (/x fc |/j ,r) 
fe=i Cauchy 2 (^ fe |/io,T) 



C^T CJlT Cauchy 2 (^|Ao,T) dp k ' ' 



(29) 



Here, T is given by Equation (fH?)) and the integrals are needed because of the prior constraints on /i. If ro > 1 
then set [io — /to, otherwise set flo = /to with probability ro. We have found a good choice for Vq to be the 
sample covariance matrix of /2. 

6. Finally, update the value of A, the common scale matrix. Because we can approximately calculate the conditional 
distribution of A, given E, we can directly simulate from p(A\Y.). Directly simulating from the conditional 

distributions is referred to as a Gibbs sampler. We perform a Gibbs update to draw a new value of A: 

-Wishart^S') (30) 

(31) 




(32) 

For the Gibbs sampler update, we do not need to calculate an acceptance ratio, and every value of A is accepted: 
A = A. If there are any prior bounds set on E, then this is technically only an approximate Gibbs update, as 
it ignores the constraint on E. A true MHA update would account for the constraint on E by renormalizing 
the conditional distribution appropriately; however, this involves a triple integral that is expensive to compute. 
If there are prior bounds on E, then Equation ((30]) is approximately correct, and ignoring the normalization in 
p(j4|E) does not did not have any effect on our results. 

Steps[2HS]are repeated until the MCMC converges, where one saves the values of 9 at each iteration. After convergence, 
the MCMC is stopped, and the values of 9 may be treated as a random draw from the marginal posterior distribution of 
9, p(fl |log L bs Tog z bs). Techniques for monitoring convergence of the Markov Chains are described in lGelman et al.l 
(|2004h . If one wishes to assume a uniform prior on p, and E, constrained within some set range, instead of the prior we 
suggest in § 14.11 then only steps HHH need to be performed. Given the values of 9 obtained from the MCMC, one can 
then draw values of N from the negative binomial density (cf. Eq.|13|). In § [C] we describe how to simulate random 
variables from a negative binomial distribution. The speed of our MHA algorithm depends on the sample size and the 
programming language. As a rough guide, on a modern computer our MHA can take a couple of hours to converge 
for sample sizes of ~ 1000, and our MHA can take as long as a day or two to converge for sample sizes ~ 10 4 . 

When performing the MCMC it is necessary to perform a 'burn-in' stage, after which the Markov chains have 
approximately converged to the posterior distribution. The values of 9 from the MCMC during the burn-in stage 
are discarded, and thus only the values of 9 obtained after the burn-in stage are used in the analysis. We have 
found it useful to perform ~ 10 4 iterations of burn-in, although this probably represents a conservative number. In 
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addition, the parameters for the MHA jumping distributions should be tuned during the burn-in stage. In particular, 
the parameters S a , cr;L , c n , S^fc, and Vk should be varied within the burn-in stage to make the MHA more efficient 
and have an acceptance rate of ~ 0.15-0.4 (iGelman. Roberts, fc Gilks 1995V These jumping distribution parameters 
cannot be changed after the burn-in stage. iJasra et alj (|2005t ) and lNeall (|1996T ) described additional complications and 
considerations developing MHAs for mixture models. 

Some post processing of the Markov chains is necessary. This is because some chains can get 'stuck' wandering 
in regions far below flux limit, likely in the presence of a local maximum in the posterior. While such chains will 
eventually converge and mix with the other chains, they do not always do so within the finite number of iterations 
used when running the random walk MHA. We argued in § 14.11 that the Gaussian centroids should be limited to 
some specified range in L and z to prevent the chains from getting stuck. However, this is only a partial fix, as 
the minimum luminosity for the Gaussian function means may be significantly fainter than the flux limit at a given 
redshift, Lu m {z);i.e., jii > \ogLi ow and Li ow < Lu m (z). In general, we have found that divergent chains are easy to 
spot. Because the divergent chains usually get stuck in regions far below the flux limit, they correspond to luminosity 
functions with implied extremely low detection probabilities, i.e., p(I = 1|0) -C 1. As a result, the random draws of N 
from the posterior for these chains tend to have values that are too high and far removed from the rest of the posterior 
distribution of N. The divergent chains are therefore easily found and removed by inspecting a histogram of log N. In 
fact, we have found that the divergent chains often become too large for the long integer format used in our computer 
routines, and therefore are returned as negative numbers. Because negative values of N are unphysical, it is easy to 
simply remove such chains from the analysis. 

Having obtained random draws of N and 8 from p(9, N\ logL {, s , log z b s ), one can then use these values to calculate 
an estimate of 4>(L,z), and its corresponding uncertainty. This is done by inserting the MCMC values of 8 and N 
directly into Equation (Tl8|) . The posterior distribution of (j>(L, z) can be estimated for any value of L and z by plotting 
a histogram of the values of (j>(L, z) obtained from the MCMC values of 8 and N. In § [6l we illustrate in more detail 
how to use the MHA results to perform statistical inference on the LF. 

6. APPLICATION TO SIMULATED DATA 

As an illustration of the effectiveness of our method, we applied it to a simulated data set. We construct a simulated 
sample, and then recover the luminosity function based on our mixture of Gauss ian functions model. We assume the 
effective survey area and selection function reported for the DR3 quasar sample (Richard s et al.ll2006l ). 

6.1. Construction of the Simulated Sample 

We first drew a random value of Nq quasars from a binomial distribution with probability of success Q/Att = 0.0393 
and number of trials N = 3 x 10 5 . Here, f2 = 1622 deg is the effective sky area for our simulated survey, and we chose 
the total number of quasars to be N = 3 x 10 5 in order to ultimately produce a value of n ~ 1300 observed sources, 
after accounting for the SDSS selection function. This first step of drawing from a binomial distribution simulates a 
subset of Nq ~ 1.2 x 10 4 sources from N total sources randomly falling within an area f2 on the sky. For simplicity, in 
this simulation we ignore the effect of obscuration on the observed quasar population. While our choice of N = 3 x 10 5 
produces a mu ch smaller sample th an the actual sample of n ~ 1.5 x 10 4 quasars from the SDSS DR3 luminosity 
function work (|Richards et al. 2006), we chose to work with this smaller sample to illustrate the effectiveness of our 
method on more moderate sample sizes. 

For each of these Nq ~ 1.2 x 10 4 sources, we simulated values of L and z. We first simulated values of logz from a 
marginal distribution of the form 

4r(q + b) exp(aC) 
T(a)T(b) (i + exp ( C *)) c 



/(logs) = w^vFm — ,>. M a+6 - ( 33 ) 



where = 4(log2 — 0.4). The parameters a = 1.25 and b = 2.5 were chosen to give an observed redshift distribution 
similar to that seen for SDSS DR3 quasars (e.g.. lRichards et aT1l2006h . Values of log z are easily drawn from Equation 
(|3"3")) by first drawing x* ~ Beta(a,6), and then setting logz = logit(x*)/4 + 0.4; here, Beta(a, b) is a beta probability 
density, and logit(a;) = ln.(x/(l — x)) is the logit function. 

For each simulated value of z, we simulated a value of L using a similar functional form. The conditional distribution 
of log L given z is 

I[logLlz) r(a(2))rcg(z))r 1 + (i/L . (ig))1/Inl o ] «w+«») [6 } 

a(z)=6 + logz (35) 

/3(z)=9 + 21ogz (36) 

L*{z) = 10 45 z 2 , (37) 

where L*(z) approximately marks the location of the peak in /(logL|z), t(z) is the age of the universe in Gyr at 
redshift z, a(z) is the slope of log /(log L\z) for L < L*(z), and (3(z) is the slope of log/(logL|z) for L > L*(z). In 
this simulated 'universe', both the peak and logarithmic slopes of the LF evolve. The form of the luminosity function 
assumed by Equation (|34[) is similar to the double power-law form commonly used in the quasar LF literature, but 
has a more gradual transition between the two limiting slopes. 
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Fig. 5. — The distribution of L and z for the simulated sample described in § 16,11 Red dots denote sources included in the sample, 
black dots denote sources not included in the sample, the blue line denotes L* as a function of z, and the green contours display the 2-d 
luminosity function. Also shown are histograms of the marginal distributions of logL and z, for all simulated objects (black histogram) 
and only the detected ones (red histogram). For clarity, the histogram of the detected sources has been forced to peak at a value equal to 
half of the peak of the histogram of all objects. 

After using Equations ([55]) and to generate random values of L and z, we simulated the effects at a selection 
function. We randomly kept each source for z < 4.5, where the probability of including a source given i ts lu minosity 
and redshift was taken to be the SDSS DR3 Quasar selection function, as reported by iRichards et all (|2006f ). After 
running our simulated sample through the selection function, we were left with a sample of n ~ 1300 sources. Therefore, 
our simulated survey is only able to detect ~ 0.4% of the N — 3 x 10 5 total quasars in our simulated 'universe'. The 
distributions of L and z are shown in Figure [5] for both the detected sources and the full sample. As can be seen, the 
majority of sources are missed by our simulated survey. 

The joint probability distribution of L and z is f(L,z) = f{L\z)f(z) 1 and therefore Equations (|33[) and (|34[) imply 
that the true LF for our simulated sample is 

ML,z) = {%) /0og£|2)/(logz) (38) 

Figure O shows (j>o(L, z) at several redshifts. Also shown in Figure [5] is the best fit for a mixture of K = 4 Gaussian 
functions. Despite the fact that (f>o(L,z) has a rather complicated parametric form, a mixture of four Gaussian 
functions is sufficient to achieve an excellent approximation to (f>o(L, z); in fact, the mixture of four Gaussian functions 
approximation is indistinguishable from the true LF. 

6.2. Performing Statistical Inference on the LF with the MCMC Output 

We performed the MHA algorithm described in §[5]to obtain random draws from the posterior probability distribution 
for our this simulated sample, assuming the Gaussian mixture model described in § 21 We performed 10 4 iterations 
of burn-in, and then ran the Markov chains for 3 x 10 4 more iterations. We ran 20 chains simultaneously in order to 
monitor convergence (e.g., see lGehnan et al1l2004l) and explore possible multimodality in the posterior. We saved the 
values of 9 for the Markov chains after the initial 10 burn-in iterations, and, after removing divergent chains with 
N < we were left with ~8x 10 4 random draws from the posterior distribution, p(9, N\L„t s , z b s )- 

The output from the MCMC can be used to perform statistical inference on the LF. Denote the T~8x 10 4 random 
draws of 6 and N obtained via the MHA as 9 1 , . . . , 9 T and N 1 , . . . , N T , respectively. The individual values of (#*, TV*) 
can then be used to construct histograms as estimates of the posterior distribution for each parameter. For each 
random draw of 9 and N, we can also calculate a random draw of 4>(L, z) from its posterior distribution. In particular, 
the t th random draw of the LF under the mixture of normals model, denoted as 0*(L,z), is calculated by inserting 
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Fig. 6. — The true LF (solid red line) at several values of z, and the best K = 4 Gaussian function fit (dashed black line). In this case, 
approximating the LF with four 2-dimensional Gaussian functions provides an excellent fit. 



0* and AT* into Equation JTSJ). The T values of <ft{L,z) can then be used to estimate the posterior distribution of 
4>(L, z) for any given value of L and z. Furthermore, random draws from the posterior for quantities that are computed 
directly from the LF, such as the location of its peak as a function of z, are obtained simply by computing the quantity 
of interest from each of the T values of 0*(L, z). 

In Figures [7J and [8] we show <^(log L, z) at several different redshifts, on both a linear scale and a logarithmic scale. In 
general, we find it easier to work with 4>(\og L, z) = In 10L<j)(L, z), as <fi(log L, z) can span several orders of magnitude 
in L. Figures [7] and [8] show the true value of the LF, 4>o(logL, z), the best-fit estimate of 4>(\ogL,z) based on the 
mixture of Gaussian functions model, and the regions containing 90% of the posterior probability. Here, as well as 
throughout this work, we will consider the posterior median of any quantity to be the 'best-fit' for that quantity. In 
addition, in this work we will report errors at the 90% level, and therefore the regions containing 90% of the posterior 
probability can be loosely interpreted as asymmetric error bars of length ss 1.65a. The region containing 90% of the 
probability for cj)(logL, z) is easily estimated from the MCMC output by finding the values of t\ and ti such that 90% 
of the values of <^ 1 (log L, z), . . . , (j> T (logL, z) have 0* 1 (log L, z) < cjf (log L, z) < <fi t2 (log L, z). As can be seen, the true 
value of </>(logL,z) is contained within the 90% probability region for all almost values of L, even those below the 
survey detection limit. 

Figure [9] compares the true integrated z < 6 number distribution of logL, n(logL, z < 6), with the mixture of 
Gaussian functions estimate. The quantity n(logL, z < 6)d\ogL gives the number of quasars at z < 6 with black hole 
masses between logL and logL + dlogL. It is calculated as 



n(logL, z < 6) 
which, for the mixture of normals model, is 



cj)(\ogL,z) 



dV 

dz 



dz. 



n(logL, z < z ) - 



K 

= JV5^7r*JVCIogL|/n ) fc,af ffc )$ 

k=l 



logzo - El\ogz\L, k) 
y/Var(logz\L, k) 



(39) 



(40) 
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Fig. 7. — The true LF (solid red line) at several redshifts for the simulated sample described in § 16,11 The axis labels are the same for all 
panels, but for clarity we only label the bottom left panel. Also shown is the posterior median estimate of the LF based on the mixture of 
Gaussian functions model (dashed blue line), the region containing 90% of the posterior probability (shaded region). The bayesian mixture 
of Gaussian functions model is able to accurately constrain the LF, even below the survey detection limit. 



E(logz\L, k) = fj. Ztk + < ^ L (logL - fj, l k ) (41) 
a i 

Var{\ogz\L,k) = al k -^. (42) 

a z,k 

Here, $(•) is the cumulative distribution function for the standard normal density. Similar to Figures [7] and [HI the 
true value of n(logL, z < 6) is contained within the 90% probability region for all values of L, even those below the 
survey detection limit. 

In addition, in Figure [5] we show the comoving number density of broad line AGN as a function of rcdshift, n(z). 
This is obtained by integrating <fi(L, z) over all possible values of L. For the mixture of normals model, this becomes 

dV^ 1 K 



n(z)=N(^) $> fe p(z|fc), (43) 

where the marginal distribution of z\k is 



fc=i 



Pim = ^exp{-I (^JhA 2 ) . (44) 



z\nlQ^2TTa 2 zk { 1 V **,k 

As before, the true value of n(z) is contained within the 90% probability region, despite the fact that the integration 
extends over all L, even those below the detection limit. The wider confidence regions reflect additional uncertainty 
in n(z) resulting from integration over those L below the detection limit. In particular, the term dV/dz becomes small 
at low redshift, making the estimate of n(z) more unstable as z — > 0, and thus inflating the uncertainties at low z. 

Two other potentially useful quantities are the comoving luminosity density for quasars, Pl(z), and its derivative. 
The comoving quasar luminosity density is given by Pl(z) = L<p(L, z) dL. For the mixture of Gaussian functions 
model it may be shown that 



dV^ 1 K 



P l{z) = n"" ' ' '" 

k=i 



^7r fc p(z|/c)exp jlnlO£(logL|z,/c) + ( ln ^°) Var{logL\z,k)\ (45) 
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Fig. 8. — Same as Figure[7] but shown with a logarithmic stretch. 
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Fig. 9. — The integrated z < 6 quasar number density (number per logL interval, left) and the quasar comoving quasar number density 
as a function of z (number per mpc 3 , right) for the simulated sample described in ij l6.ll As with Figure [7] the solid red line denotes the true 
value for the simulation, the dashed blue line denotes the posterior median for the mixture of Gaussian functions model, and the shaded 
region contain 90% of the posterior probability. The posterior median provides a good fit to the true values, and the uncertainties derived 
from the MCMC algorithm based on the Gaussian mixture model are able to accurately constrain the true values of these quantities, 
despite the flux limit. 



E(log L\z, k) = m,k + (log z - pL z< k) (46) 

a z,k 
2 

Var{logL\z,k)=al k -^, (47) 

a z,k 

where p(z\k) is given by Equation ([4^|) . We calculate the derivative of Pl{ z ) numerically. Figure [TU1 compares the true 
values of Pl{z) and its derivative with the posterior distribution for Pl(z) inferred from the mixture model, both as 
a function of z and the age of the universe at redshift z, t(z). Comparison with Figure O reveals that the comoving 
quasar luminosity density, pl(z), is a better constrained quantity than the comoving quasar number density, n{z). 
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Fig. 10. — Comoving quasar luminosity density (top two panels) and its derivative (bottom two panels), shown as a function of redshift 
(left two panels) and cosmic age (right two panels) for the simulated sample described in § 16.11 The plotting symbols are the same as 
in Figure [9] As in the previous figures, the Gaussian mixture model is able to provide an accurate fit to the true values of Pl(z), and 
the bayesian MCMC approach is able to provide accurate constraints on Pl(z) and dp^/dz, despite the fact that the integral used for 
calculating these quanties extends below the survey detection limit. 



Furthermore, n{z) appears to peak much later than Pl{ z )- I n addition, we can correctly infer that the comoving quasar 
luminosity density reaches it point of fastest growth at t(z) ~ 2 Gyr, and its point of fastest decline at t(z) ~ 5 Gyr. 

Figure [TT1 quantifies the suggestion that n(z) peaks later than pl(z) by displaying the posterior distribution for the 
location of the respective peaks in n(z) and pl(z). We can still constrain the peak in n(z) to be at z < 0.5. In contrast, 
the location of the peak in pl{z) is constrained to occur earlier at 1 < z < 3. This is a consequence of the fact that 
while there were more quasars per comoving volume element in our simulated universe at z < 0.5, their luminosities 
were much higher at higher redshift. This evolution in characteristic L is quantified in Figure [T2l which summarizes 
the posterior distribution for the location of the peak in </>(logL, z) as a function of redshift and t(z). As can be seen, 
the location of the peak in the LF shows a clear trend of increasing 'characteristic' L with increasing z, although there 
is considerable uncertainty on the actual value of the location of the peak. 

6.3. Using the MCMC Output to Evaluate the LF Fit 

Throughout this section we have been analyzing the MCMC results by comparing to the true LF. However, in 
practice we do not have access to the true LF, and thus a method is needed for assessing the qua lity of the fit. The 
statistical model may be chec ked using a technique known as posterior predictive checking (e.g.. lRubinlll98ll Il984t 
iGelman. Meng. fc Sternlll998h . Here, the basic idea is to use each of the MCMC outputs to simulate a new random 
observed data set. The distributions of the simulated observed data sets are then compared to the true observed data 
in order to assess whether the statistical model gives an accurate representation of the observed data. It is important 
to construct simulated data sets for each of the MCMC draws in order to incorporate our uncertainty in the model 
parameters. 

For each value of N* and 9 t obtained from the MCMC output, a simulated data set of (Z* bs , z l obs ) may be obtained 
through a similar procedure to that described in § 16.11 First, one draws a value of Nk from a binomial distribution 
with iV* trials and probability of 'success' p = Q/4n. Then, one draws Nq values of L and z l from p(L, z\6 l ). 

For our model, p(logL,logz|f?*) is a mixture of normal densities, and one needs to employ a two-step process in 
order to simulate a random value from p(logL,logz|#*). First, one needs to randomly assign the i th data point 
to one of the Gaussian distributions. Since irk gives the probability that a data point will be drawn from the fc th 
Gaussian distribution, one first needs to simulate a random vector G* from a multinomial distribution with one trial 
and probability of success for the fc th class 7r£; i.e., first draw G* ~ Multinom(l, ir\, . . . ,7ri-). The vector G* gives 
the class membership for the i th data point, where G\ k — 1 if the i th data point comes from the k th Gaussian, and 
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Location of Peak in Quasar n(z) Location of Peak in Quasar p L (z) 

Fig. 11. — Posterior distribution for the redshift location of the peak in the como ving number density of quasars (left) and the peak in the 
comoving quasar luminosity density (right) for the simulated sample described in § 16,11 For clarity we only show the posterior distribution 
for the peak in n(z) at z > 0.5, since values of the peak at z < 0.5 arise because the term (dV/dz) -1 becomes very large at low z. The 
vertical lines denote the true values. The posterior distribution inferred from the MCMC output is able to accurately constrain the true 
values of the argumentative maximum in n(z) and p^(z). 




Redshift Age of Universe 



Fig. 12. — Location of the peak in the LF as a function of z (left) and cosmic age (right) for the simulated sample described in § 16.11 The 
plot symbols are the same is in Figure [9] In general the posterior median of the Gaussian mixture model provides a good estimate of the 
true peak locations, although the uncertainty is high due to the survey flux limit. However, it is clear from these plots that the location of 
the peak in </>(£, z) evolves. 



G'- = if j ' ^ k. Then, given G\ k = 1, one then simulates a value of (log L\, log z\) from a 2-dimensional Gaussian 
distribution with mean ji k and covariance matrix EjS.. This is repeated for all Nq sources, leaving one with a random 
sample (log I/*, logz 4 ) ~ p(logL, logz|0 4 ). 

A random draw from Multinom(l, tti, . . . ,ttk), may be obtained as a sequence of binomial random draws. First, 
draw rii ~ Binomial(l, tti). If n! x — 1, then assign the data point to the first Gaussian distribution, i.e., set Gn = 1. 

If n'i — 0, then draw n' 2 ~ Binomial(l, 712/ X)fe=2 If n 2 — 1> then assign the data point to the second Gaussian 
distribution, i.e., set Gi2 = 1. If = 0, then the process is repeated for the remaining Gaussian distribution as 
follows. For j = 3, . . . , K — 1, sequentially draw n'j ~ Binomial(l, ttj/ Ylk=j 7r ' £ )- ^ a ^ an y time n'j = 1, then stop the 
process and assign the data point to the j th Gaussian distribution. Otherwise, if none of the n'^ — 1, then assign the 
data point to the K th Gaussian distribution. 

Once one obtains a random draw of randomly 'observe' these sources, where the probability of including 

a source given L\ and z\ is given by the selection function. This will leave one with a simulated observed data set, 
(I/ 4 ^, zl bs ). This process is repeated for all T values of N l and l obtained from the MCMC output, leaving one with 
T simulated data sets of (L t obs , z* bs ). One can then compare the distribution of the simulated data sets with the true 
values of L t s , and z t s to test the statistical model for any inconsistencies. 

In Figure \T3\ we show histograms for the true observed distributions of z and logL. These histograms are compared 
with the posterior median of the distributions based on the mixture of Gaussian functions model, as well as error bars 
containing 90% of the simulated values. Also shown is a plot comparing the true values of the maximum of L Q b s as 
a function of z with those based on L l obs and z l obs . As can be seen, the distributions of the observed data assuming 
the mixture of Gaussian functions model are consistent with the true distributions of the observed data, and therefore 
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Fig. 13. — Posterior predictive check for the Gaussian mixture model (see § 16.31 1. The histograms show the actual distributions of L t, 3 
and £063, the red squares denote the posterior medians for the number of sources in each respective bin, and the error bars contain the 
inner 90% of the histogram values for the samples simulated from the posterior. Also shown is a plot of the maximum observed luminosity 
as a function of z for the simulated samples, the red squares mark the value from the actual sample used in the fit, and the error bars 
contain 90% of the values simulated from the posterior distribution. The mixture of Gaussian functions model is able to provide an accurate 
prediction of the observed distribution of luminosity, redshift, and line widths, and thus there is not any evidence to reject it as providing 
a poor fit. 

there is no reason to reject the mixture model as providing a poor fit. 

7. SUMMARY 

We have derived the observed data likelihood function which relates the quasar LF to the observed distribution of 
rcdshifts, luminosities. This likelihood function is then used in a Bayesian approach to estimating the LF, where the 
LF is approximated as a mixture of Gaussian functions. Because much of this work was mathematically technical, we 
summarize the important points here. 

• Equation [5] gives the likelihood function for an assumed parametric luminosity function. This likelihood function 
differs from the Poisson likelihood commonly used in the LF literature because it correctly models the sample 
size as a binomial random variable, whereas the Poisson likelihood approximates the sample size as a Poisson 
random variable. In practice, the difference in the maximum-likelihood estimates obtained from the two likelihood 
functions do not seem to be significantly different so long as the probability of including a source in a survey is 
small. 



The product of Equations (|12[) and (fT5)) is the joint posterior probability distribution of the LF, given the 
observed data. These equations may be used to perform Bayesian inference on the LF, after assuming a prior 
distribution on the LF parameters. Bayesian inference is often most easily performed by simulating random 
variables drawn from the posterior probability distribution. These random draws may be used to estimate the 
posterior distribution for the LF, as well as to estimate the posterior distribution for any quantities calculated 
from the LF. The posterior distribution provides statistically accurate uncertainties on the LF and related 
quantities, even when the sample size is small and one is including information below the survey detection limits. 
In contrast, confidence intervals derived from bootstrapping the maximum-likelihood estimate can be too small. 

We describe a flexible model for the LF, where the LF is modeled as a mixture of Gaussian functions. Equation 
(|17p describes the probability distribution of log L and log z under the mixture of Gaussian functions model, and 
(fT5)) describes the LF under the mixture of Gaussian functions model. 

Equation (|21[) gives our prior distribution for the Gaussian function parameters. The marginal posterior distri- 
bution of the mixture model parameters is given by Equation (|22p . the conditional posterior distribution of ./V 
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at a given 9 is given by Equation (fT3|) , and the complete joint posterior distribution is the product of Equations 
([22]) and (fl3|). 

• We describe in § a Metropolis-Hastings algorithm for obtaining random draws from the posterior distribution 
for the LF assuming a Schechter function or mixture of Gaussian functions model. In § [5J we use a simulated 
sample, modeled after the SDSS DR3 Quasar Catalog, to illustrate the effectiveness of our statistical method, 
as well as to give an example on how to use the Metropolis-Hastings output to perform statistical inference on 
the LF and assess the LF fit. 

So long as the mixture of Gaussian functions model is an accurate approximation to the true LF over all luminosities, 
the uncertainties on the LF, assuming the Gaussian mixture model, are trustworthy because they are calculated directly 
from the probability distribution of the LF, given the observed data. Statistical inference on the LF below the flux 
limit can become significantly biased if one assumes an incorrect and restrictive parametric form, as extrapolation 
errors can become large. In this case, the derived posterior may not contain the true LF because the wrong parametric 
model was assumed; this type of error is known as model misspecification. For example, consider a case when the true 
LF is a Schechter function, and one is only able to detect sources brighter than L*. If one were to assume a power- law 
for the LF, extrapolation below the flux limit would be significantly biased. In contrast, the mixture of Gaussian 
functions model, while incorrect, is flexible enough to accurately approximate the true Schechter function form, thus 
minimizing extrapolation bias due to model misspecification. Of course, in this example, the most accurate results 
would be obtained by fitting a Schechter function to the data, since it is the correct parametric form. Therefore, the 
mixture of Gaussian functions model will not perform as well as assuming the correct parametric model, or at least as 
well as an alternative parametric model that better approximates the true LF. 

Although we have focused on the mixture of Gaussian functions model, the likelihood and posterior distribution are 
applicable for any parametric form, as illustrated in § 13.21 The observed data likelihood function for the LF is given 
by Equation ([6]), and the posterior distribution is given by the product of Equations (|12p and (|13p . Then, one can 
use Equation {T]) to 'plug-in' any parametric form of the LF into the appropriate likelihood function and posterior 
distribution, as was done in Equation (|15| for a Schechter function. In addition, the metropolis-hastings algorithm is 
a general method for obtaining random draws from the posterior, and can be developed for any parametric form of 
the LF. This therefore allows one to perform Bayesian inference for any variety of parametric models of the LF, and 
one is not merely limited to the mixture of Gaussian functions model or Schechter function considered in this work. 

An IDL computer routine written for performing the Metropolis-Hastings algorithm for the mixture of Gaussian 
functions model is available on request from B. Kelly. 

We acknowledge support from NSF grant AST 03-07384 and a David and Lucile Packard Fellowship in Science and 
Engineering. 

APPENDIX 

MAXIMUM-LIKELIHOOD VS BAYESIAN INFERENCE 

In this section we compare the maximum likelihood approach with the Bayesian approach. We do this for readers 
who are unfamiliar with some of the more technical aspects of the two approaches, with the hope that the discussion 
in this section will facilitate interpretation of our results in the main body of the paper. 

In maximum-likelihood analysis, one is interested in finding the estimate that maximizes the likelihood function 
of the data. For a given statistical model, parameterized by 9, the likelihood function, p(x\9), is the probability of 
observing the data, denoted by x, as a function of the parameters 9. The maximum-likelihood estimate, denoted by 
9, is the value of 9 that maximizes p(x\9). Under certain regularity conditions, 9 enjoys a number of useful properties. 
In particular, as the sample size becomes infinite, 9 becomes an unbiased estimate of 9. An unbiased estimator is an 
estimator with expectation equal to the true value, i.e., E{9) = 9q, where 9q is the true value of 9. Therefore, on 
average, an unbiased estimator will give the true value of the parameter to be estimated. 

Because the maximum likelihood estimate is a function of the data, 9 has a sampling distribution. The sampling 
distribution of 9 is the distribution of 9 under repeated sampling from the probability distribution of the data. Under 
certain regularity conditions, the sampling distribution of 9 is asymptotically normal with covariance matrix equal to 
the average value of the inverse of the Fisher information matrix, 1(9), evaluated at 9q. The Fisher information matrix 
is the expected value of the matrix of second derivatives of the log-likelihood, multiplied by —1. Formally, this result 
states that as n — > oo, then 

9~N p (9 ,I- 1 (9 )), (Al) 
I{0) = -E(J^lnp(x\0)\ (A2) 

where p is the number of parameters in the model, and the expectation in Equation (|A2|1 is taken with respect to 

the sampling distribution of x, p(x\9o). Because we do not know 9q, it is common to estimate I^ 1 (9q) by I~ 1 (9). In 
addition, it is common to estimate 1(9) as the matrix of second derivatives of the log-likelihood of one's data, since 
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the sample average is a consistent estimate for the expectation value. Qualitatively, Equation (|Al[) states that as the 
sample size becomes large, 9 approximately follows a normal distribution with mean 9q and covariance matrix I~ (9). 
This fact may be used to construct confidence intervals for 9. 

While the asymptotic results are useful, it is not always clear how large of a sample is needed until Equation (|Alj) is 
approximately true. The maximum likelihood estimate can be slow to converge for models with many parameters, or 
if most of the data is missing. Within the context of luminosity function estimation, the maximum-likelihood estimate 
will be slower to converge for surveys with shallower flux limits. In addition, Equation (|A1| does not hold if the 
regularity conditions are not met. In general, this is not a concern, but it is worth noting that the asymptotics do 
not hold if the true value of 9 lies on the boundary of the parameter space. For example, in the case of a Schechter 
luminosity function, if the true value of the shape parameter, a (see [14]), is a$ — — 1, then Equation (|Aip does not 
hold, since a > —1. If ao ~ — 1, then Equation (lAlj) is still valid, but it will take a large sample before the asymptotics 
are valid, as ao hes near the boundary of the parameter space. 

In Bayesian analysis, one attempts to estimate the probability distribution of the model parameters, 9, given the 
observed data x. The probability distribution of 9 given x is related to the likelihood function as 

p(9\x) (xp(x\9)p{6). (A3) 

The term p(x\9) is the likelihood function of the data, and the term p(9) is the prior probability distribution of 9; 
the result, p{9\x) is called the posterior distribution. The prior distribution, p{9), should convey information known 
prior to the analysis. In general, the prior distribution should be constructed to ensure that the posterior distribution 
integrates to one, but to not have a significant effect on the posterior. In particular, the posterior distribution should 
not be sensitive to the choice of prior distribution, unless the prior distribution is constructed with the purpose of 
placing constraints on the posterior distribution that are not conveyed by the data. The contribution of the prior to 
p(9\x) becomes negligible as the sample size becomes large. 

From a practical standpoint, the primary difference between the maximum likelihood approach and the Bayesian 
approach is that the maximum likelihood approach is concerned with calculating a point estimate of 9, while the 
Bayesian approach is concerned with mapping out the distribution of 9. The maximum likelihood approach uses 
an estimate of the sampling distribution of 9 to place constraints on the true value of 9. In contrast, the Bayesian 
approach directly calculates the probability distribution of 9, given the observed data, to place constraints on the true 
value of 9. It is illustrative to consider the case when the prior is taken to be uniform over 9; assuming the posterior 
integrates to one, the posterior is then proportional to the likelihood function, p{9\x) oc p(x\9). In this case, the goal 
of maximum likelihood is to calculate an estimate of 9, where the estimate is the most probable value of 9, given 
the observed data. Then, confidence intervals on 9 are derived from the maximum likelihood estimate, 9, usually by 
assuming Equation (jAlj) . In contrast, the Bayesian approach is not concerned with optimizing the likelihood function, 
but rather is concerned with mapping out the likelihood function. Under the Bayesian approach with a uniform prior, 
confidence intervals on 9 are derived directly from likelihood function, and an estimate of 9 can be defined as, for 
example, the value of 9 averaged over the likelihood function. So, the maximum likelihood attempts to obtain the 
'most likely' value of 9, while the Bayesian approach attempts to directly obtain the probability distribution of 9, given 
the observed data. Because the Bayesian approach directly estimates the probability distribution of 9, and because 
it does not rely on any asymptotic results, we consider the Bayesian approach to be preferable for most astronomical 
applications. 

DERIVATION OF THE MARGINAL POSTERIOR DISTRIBUTION FOR TRUNCATED DATA 

Here, we give a derivation of the posterior probability distribution of 9, given by Equation (TT2"]) . If we assume a 
uniform prior on log TV, then this is equivalent to assuming the prior p(9,N) oc N~ 1 p(9). In this case, the posterior 
distribution is given by 

p(6,N\L obs ,z obs ) <xN- 1 p(0)CZ\p(I = O\0)] It - n JJ p(£i,*|0). (Bl) 

i£A obs 

The marginal posterior distribution of 9 is obtained by summing the joint posterior over all possible values of N. For 
the choice of prior p(9, log N) oc p(9), the marginal posterior of 9 is 



p(9\L obs , z obs )cxp(9) 



n p(^'- 



JeAobs 
( xp(9)[p(I=l\9)Y 



J2N- 1 Cn P (I^0\0)] N - n (B2) 

OO 



n p( L - 

.ieAab S 



N=n 



where we arrived at the second Equation by multiplying and dividing the first Equation by p(I — l\9) n and noting that 
C% = C„Si(N/n). The term within the sum is the mathematical expression for a negative binomial distribution as a 
function of N (sec Eq. |Cl| ). Because probability distributions must be equal to unity when summed over all possible 
valu es, the sum is just equal to one. We therefore arrive at Equation (fl"2"|) by replacing the summation in Equation 
(IB3I) with the value of one. 
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SOME PROBABILITY DISTRIBUTIONS USED IN THIS WORK 



In this section of the appendix we briefly describe some probability distribution that we employ, but may be unfamiliar 
to some astronomers. 

Negative Binomial 

The negative binomial distribution is closely related to the binomial distribution. The binomial distribution gives the 
probability of observing n 'successes', given that there have been ./V trials and that the probability of success is p. In 
contrast, the negative binomial distribution gives the probability of needing N trials before observing n successes, given 
that the probability of success is p. Within the context of this work, the binomial distribution gives the probability of 
detecting n sources, given that there are N total sources and that the detection probability is p. The negative binomial 
distribution gives the probability that the total number of sources is N, given that we have detected n sources and 
that the detection probability is p. The negative binomial distribution is given by 

p(N\n,p) = C»S l 1 p n (l- P ) N - n , N>n. (CI) 

A random draw from the negative binomial distribution with parameters n and p may be simulated by first drawing 
n random values uniformly distributed on [0,1], u\, . . . , u n ~ Uniform(0, 1). Then, calculate the quantity 



m 



E 

i=l 



log Ui 



log(l - p) 



(C2) 



where [-J is the floor function, i.e., [a; J denotes the greatest integer less than or equal to x. The quantity N = n + m 
will then follow a negative binomial distribution with parameters n and p. 

Dirichlet 

The Dirichlet distribution is a multivariate generalization of the Beta distribution, and it is commonly used when 
modeling group proportions. Dirichlet random variables are constrained to be positive and sum to one. The Dirichlet 
distribution with argument 9i, ... ,6k and parameters a±, . . . , otk is given by 



p(0 1 



, 9k\ai 



T(ai 



CUfc) 



,0k > 0, ai, . . . ,a k > 0, 



k 

E< 

i=l 



= 1. 



(C3) 



To draw a random value 6\, . . . ,9 k from a Dirichlet distribution with parameters a\, . . . ,a k , first draw x\, . . . , Xk 
independently from Gamma distributions with shape parameters a\ , . . . , a.k and common scale parameter equal to 
one. Then, set 6j — Xj/^2 i=1 X{. The set of 9 will then follow a Dirichlet distribution. 

Multivariate Student-t and Cauchy Distribution 

The Student-i distribution is often used as a robust alternative to the normal distribution because it is more heavily 
tailed than the normal distribution, and therefore reduces the effect of outliers on statistical analysis. A t distribution 
with v = 1 degree of freedom is referred to as a Cauchy distribution, and it is functionally equivalent to a Lorentzian 
function. A p-dimensional multivariate t distribution with p-dimensional argument x, p-dimensional mean vector p., 
p x p scale matrix S, and degrees of freedom v is given by 

1 -(f+p)/2 



P(x|m,S,i/) 



r((f + p)/2) 

r(^/2)^/ 2 7rP/ 2 



-1/2 



1 



1 + -(x-m) j E-^x-m) 



(C4) 



The 1-dimensional t distribution is obtained by replacing matrix and vector operations in Equation (|C4j) with scalar 
operations. 

Although we do not simulate from a t distribution in this work, for completeness we include how to do so. To 
simulate a random vector t from a multivariate t distribution with mean vector \i, scale matrix E, and degrees of 
freedom v, first draw z from a zero mean multivariate normal distribution with covariance matrix E. Then, draw x 
from a chi-square distribution with v degrees of freedom, and compute the quantity t = /x + zy/ v/x. The quantity t 
is then distributed according to the multivariate t distribution. 

Wishart and Inverse Wishart 

The Wishart distribution describes the distribution of the pxp sample covariance matrix, given the pxp population 
covariance matrix, for data drawn from a multivariate normal distribution. Conversely, the inverse Wishart distribution 
describes the distribution of the population covariance matrix, given the sample covariance matrix, when the data are 
drawn from a multivariate normal distribution. The Wishart distribution can be thought of as a multivariate extension 
of the x 2 distribution. A Wishart distribution with pxp argument S, p x p scale matrix E, and degrees of freedom v 
is given by 



p(S|£,i/) 



2"p/ 2 7r P(p- 1 )/ i Y[Y 



i=l 



1 



| Er ,/2| 5 |(,- P -l)/2 exp |_l^ (s -l 5) | ; 



(C5) 
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where the matrices S and £ are constrained to be positive definite. An inverse Wishart distribution with pxp argument 
S,pxp scale matrix S, and degrees of freedom v is 



P (ns,v) 



2"p/ 2 7T p(p- 



l)/4 



1 



| 5 |./2| S |- (l ,+ P +l)/2 exp J tr (p-lS) 



1 



(C6) 



where the matrices S and £ are constrained to be positive definite. 

To draw a, pxp random matrix from a Wishart distribution with scale matrix T, and v degrees of freedom, first draw 
Xi, . . . ,x„ from a zero mean multivariate normal distribution with pxp covariance matrix S. Then, calculate the 
sum S = Yl'j—i x i x r ■ The quantity S is then a random draw from a Wishart distribution. Note that this technique 
only works when v > p. A random draw from the inverse Wishart distribution with scale matrix S and degrees of 
freedom v may be obtained by first obtaining a random draw W from a Wishart distribution with scale matrix S^ 1 
and degrees of freedom v. The quantity E = W~ x will then follow an inverse Wishart distribution. 
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